3 Description of the adopted methodology
The adopted methodolgy consists of data preparation, feature engineering and modelling. In the following section these parts are described in more details.
3.1 Preprocessing
In the preprocessing part we are going to deal with the outliers and NA values of each building and the weather data.
3.1.1 Outliers, NAs
3.1.1.1 Building 2
We’ve seen in the previous chapter, that building 2 has some extreme outliers.
Let’s plot both the energy produced and the sun radiation together:
building_2 %>%
plot_ly(
x=~timestamp,
y=~energy_produced,
type="scatter",
mode="lines",
name="enery produced"
) %>%
add_trace(
inherit = F,
data=building_2_sun,
x=~timestamp,
y=~sun,
type="scatter",
mode="lines",
opacity=0.5,
name="sun"
)We can see that there is nothing extreme anomaly in the sunshine, therefore those outliers cannot be explained and most likely are wrong values.
As on the other building the range of produced energy is between 0 and 100, we have decided to cut off the outliers here based on this criteria.
building_2 <- building_2[energy_produced < 100 & energy_produced >= 0]3.1.1.2 Weather
In order to be able to merge the data later on, we have to change the type of date_time column to POSIXct:
weather[,date_time := as.POSIXct(date_time, format = "%d/%m/%Y")] # 31/08/2016We notice that there is a column called “location.” This is due to the Weather API support for downloading data from multiple locations in the same dataset. As the location is always the same (Vienna), we decide to drop it.
weather[,location := NULL]3.1.2 Merging the data
We have to merge each of our building datasets with the weather data we acquired. We do this in order to be able to use features from the weather data for our forecasting. As we have 2 datasets per each building, one with sun data and the other with energy data, we first have to merge them and then we can add the weather data. We merge the datasets based on the timestamp. This is why we first need to aggregate building data daily.
The process for every building is as follows:
3.1.2.1 Building 2
b2_joined <- left_join(building_2, building_2_sun,
by = c("timestamp" = "timestamp"))
summary(b2_joined)## timestamp energy_produced sun
## Min. :2016-08-09 00:15:00 Min. :0.000 Min. : -0.0867
## 1st Qu.:2017-04-29 01:45:00 1st Qu.:0.000 1st Qu.: 0.0000
## Median :2018-01-17 04:00:00 Median :0.000 Median : 0.1733
## Mean :2018-01-18 21:45:30 Mean :0.457 Mean : 131.1446
## 3rd Qu.:2018-10-10 22:30:00 3rd Qu.:0.534 3rd Qu.: 177.2333
## Max. :2019-07-01 00:00:00 Max. :3.760 Max. :1089.6600
# aggregate daily
building2 <- b2_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
by=.(Day=floor_date(timestamp, "days"))]
summary(building2)## Day daily_total_sun daily_total_energy
## Min. :2016-08-09 00:00:00 Min. : 0 Min. : 0.00
## 1st Qu.:2017-04-29 06:00:00 1st Qu.: 4460 1st Qu.: 11.16
## Median :2018-01-17 12:00:00 Median :10695 Median : 40.00
## Mean :2018-01-18 22:01:08 Mean :12569 Mean : 43.80
## 3rd Qu.:2018-10-10 18:00:00 3rd Qu.:20393 3rd Qu.: 76.49
## Max. :2019-07-01 00:00:00 Max. :32033 Max. :110.75
building2_weather <- left_join(building2, weather,
by = c("Day" = "date_time"))3.1.2.2 Building 5
b5_joined <- left_join(building_5, building_5_sun,
by = c("timestamp" = "timestamp"))
b5_joined <- na.omit(b5_joined) # remove the last day --> bc it has NAs
summary(b5_joined)## timestamp energy_produced sun
## Min. :2016-08-09 00:05:00 Min. : 0.000 Min. : -1.056
## 1st Qu.:2017-04-18 01:56:15 1st Qu.: 0.000 1st Qu.: 1.869
## Median :2017-12-26 03:47:30 Median : 0.000 Median : 12.919
## Mean :2017-12-26 03:47:30 Mean : 1.591 Mean : 146.211
## 3rd Qu.:2018-09-04 05:38:45 3rd Qu.: 2.032 3rd Qu.: 180.131
## Max. :2019-05-14 07:30:00 Max. :18.560 Max. :1286.025
# aggregate daily
building5 <- b5_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
by=.(Day=floor_date(timestamp, "days"))]
summary(building5)## Day daily_total_sun daily_total_energy
## Min. :2016-08-09 Min. : -37.57 Min. : 0.016
## 1st Qu.:2017-04-18 1st Qu.: 15909.80 1st Qu.: 141.027
## Median :2017-12-26 Median : 34925.39 Median : 388.544
## Mean :2017-12-26 Mean : 42080.02 Mean : 457.755
## 3rd Qu.:2018-09-04 3rd Qu.: 67842.69 3rd Qu.: 758.336
## Max. :2019-05-14 Max. :110381.37 Max. :1260.672
building5_weather <- left_join(building5, weather,
by = c("Day" = "date_time"))3.1.2.3 Building 8
b8_joined <- left_join(building_8, building_8_sun,
by = c("timestamp" = "timestamp"))
summary(b8_joined)## timestamp energy_produced sun
## Min. :2016-08-09 00:05:00 Min. :0.0000 Min. : 3.354
## 1st Qu.:2017-04-30 00:03:45 1st Qu.:0.0000 1st Qu.: 3.519
## Median :2018-01-19 00:02:30 Median :0.0000 Median : 3.675
## Mean :2018-01-19 00:02:30 Mean :0.3009 Mean : 132.371
## 3rd Qu.:2018-10-10 00:01:15 3rd Qu.:0.3760 3rd Qu.: 177.771
## Max. :2019-07-01 00:00:00 Max. :2.3920 Max. :1068.678
# aggregate daily
building8 <- b8_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
by=.(Day=floor_date(timestamp, "days"))]
summary(building8)## Day daily_total_sun daily_total_energy
## Min. :2016-08-09 Min. : 3.66 Min. : 0.00
## 1st Qu.:2017-04-30 1st Qu.: 14104.25 1st Qu.: 29.45
## Median :2018-01-19 Median : 32361.41 Median : 79.45
## Mean :2018-01-19 Mean : 38086.76 Mean : 86.58
## 3rd Qu.:2018-10-10 3rd Qu.: 61039.17 3rd Qu.:142.61
## Max. :2019-07-01 Max. :151064.61 Max. :214.67
building8_weather <- left_join(building8, weather,
by = c("Day" = "date_time"))This is how the dataset looks now. It has the amount of sun and energy produced and all other variables from the weather data.
3.2 Feature engineering
There are some variables which wouldn’t be important for our forecast. As we want to predict energy which highly depends on amount of sun during the day, features like wind temperature, humidity, pressure, visibility, moon illumination and moonrise are not really relevant in this case. Based on our understanding of the domain, important predictors for the energy would be sun, uv index, sun hour and cloud cover because these are the variables which may influence the amount of sun and at the same time the amount of generated energy. Therefore, we decide to keep only these features and discard the others.
Building 2:
# final building 2 aggregated weekly
building2 <- building2_weather[, .(sun=sum(daily_total_sun),
energy=sum(daily_total_energy),
sunHour=sum(sunHour),
uvIndex=sum(uvIndex),
cloudcover=sum(cloudcover)),
by=.(Week=floor_date(Day, "weeks"))]
building2## Week sun energy sunHour uvIndex cloudcover
## 1: 2016-08-07 63058.36 235.3050 62.0 22 253
## 2: 2016-08-14 147175.98 565.9120 99.9 38 188
## 3: 2016-08-21 137723.81 538.1480 84.9 38 145
## 4: 2016-08-28 135092.62 538.9766 78.8 41 108
## 5: 2016-09-04 114039.42 472.4040 74.8 37 203
## ---
## 148: 2019-06-02 186802.58 638.3960 98.8 39 113
## 149: 2019-06-09 196700.09 662.3279 99.7 46 76
## 150: 2019-06-16 176269.59 606.3799 101.5 40 148
## 151: 2019-06-23 190691.92 634.7160 99.7 42 107
## 152: 2019-06-30 31125.16 103.2520 29.0 13 15
Building 5:
# final building 5 aggregated weekly
building5 <- building5_weather[, .(sun=sum(daily_total_sun),
energy=sum(daily_total_energy),
sunHour=sum(sunHour),
uvIndex=sum(uvIndex),
cloudcover=sum(cloudcover)),
by=.(Week=floor_date(Day, "weeks"))]
building5## Week sun energy sunHour uvIndex cloudcover
## 1: 2016-08-07 202677.93 2865.728 62.0 22 253
## 2: 2016-08-14 479672.57 6170.416 99.9 38 188
## 3: 2016-08-21 447513.56 5510.431 84.9 38 145
## 4: 2016-08-28 429050.96 5751.301 78.8 41 108
## 5: 2016-09-04 342232.86 4642.368 74.8 37 203
## ---
## 141: 2019-04-14 532379.32 6168.128 82.9 28 182
## 142: 2019-04-21 505690.77 5872.640 92.2 32 239
## 143: 2019-04-28 372687.61 4444.544 86.6 23 359
## 144: 2019-05-05 401756.42 4984.448 88.0 23 365
## 145: 2019-05-12 55829.38 702.592 33.4 8 217
Building 8:
# final building 8 aggregated weekly
building8 <- building8_weather[, .(sun=sum(daily_total_sun),
energy=sum(daily_total_energy),
sunHour=sum(sunHour),
uvIndex=sum(uvIndex),
cloudcover=sum(cloudcover)),
by=.(Week=floor_date(Day, "weeks"))]
building8## Week sun energy sunHour uvIndex cloudcover
## 1: 2016-08-07 195254.33 473.506 62.0 22 253
## 2: 2016-08-14 442996.59 1052.572 99.9 38 188
## 3: 2016-08-21 415648.10 986.530 84.9 38 145
## 4: 2016-08-28 404495.16 954.740 78.8 41 108
## 5: 2016-09-04 343209.16 831.634 74.8 37 203
## ---
## 148: 2019-06-02 561870.35 1203.696 98.8 39 113
## 149: 2019-06-09 591717.49 1262.832 99.7 46 76
## 150: 2019-06-16 364706.14 1150.704 101.5 40 148
## 151: 2019-06-23 569457.65 1186.232 99.7 42 107
## 152: 2019-06-30 93400.23 192.656 29.0 13 15
Creating ts objects so that we can use them for modelling:
building_2_ts <- ts(building2, frequency = 52, start = c(2016,8))
building_5_ts <- ts(building5, frequency = 52, start = c(2016,8))
building_8_ts <- ts(building8, frequency = 52, start = c(2016,8))ggAcf(building_8_ts)
We can see that there is a strong seasonality, as lag 0, 52 and 104, so every 52th lag is very high above the line. That means that we have a strong correlation of the previous years. If we think of how the sesons change in each year, this indeed makes a lot of sense.
3.2.1 Time series and decomposing
# Weekly data aggregated
building_2_sun_weekly <- building_2_sun[, .(weekly_sum=sum(sun)),
by=.(Week=floor_date(timestamp, "weeks"))]
building_2_weekly <- building_2[, .(weekly_sum=sum(energy_produced)),
by=.(Week=floor_date(timestamp, "weeks"))]
building_5_sun_weekly <- building_5_sun[, .(weekly_sum=sum(sun)),
by=.(Week=floor_date(timestamp, "weeks"))]
building_5_weekly <- building_5[, .(weekly_sum=sum(energy_produced)),
by=.(Week=floor_date(timestamp, "weeks"))]
building_8_sun_weekly <- building_8_sun[, .(weekly_sum=sum(sun)),
by=.(Week=floor_date(timestamp, "weeks"))]
building_8_weekly <- building_8[, .(weekly_sum=sum(energy_produced)),
by=.(Week=floor_date(timestamp, "weeks"))]Here we create time series objects out of monthly and weekly aggregated data and decompose them in order to see their components (trend, data, seasonal component and remainder)
building_2_weekly_ts = ts(building_2_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_2_weekly_ts, type="additive"))
building_5_weekly_ts = ts(building_5_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_5_weekly_ts, type="additive"))
building_8_weekly_ts = ts(building_8_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_8_weekly_ts, type="additive"))
The seasonal component shows us how with the time data gets to the same point every month/week, and from the seasonal component the future models will be built upon.
The remainders (noise) of these time series seem to be normally distributed (white noice), which indicates that future models will potentially learn from the data and the data is not bad.
The trends of the time series are not that smooth in general, which means the predictive power of the models in the future might be quite low, however by combining both seasonal and trend components the predictive power will increase in anyway, since these both are the most important for the future models.
3.2.2 Checking for stationarity
Stationarity basically means that the statistical properties of the time series data do not change over time, what is important for many analytical tools and future modeling.
adf.test(building_2_weekly_ts)##
## Augmented Dickey-Fuller Test
##
## data: building_2_weekly_ts
## Dickey-Fuller = -3.4459, Lag order = 5, p-value = 0.04975
## alternative hypothesis: stationary
adf.test(building_5_weekly_ts)##
## Augmented Dickey-Fuller Test
##
## data: building_5_weekly_ts
## Dickey-Fuller = -2.9694, Lag order = 5, p-value = 0.1728
## alternative hypothesis: stationary
adf.test(building_8_weekly_ts)##
## Augmented Dickey-Fuller Test
##
## data: building_8_weekly_ts
## Dickey-Fuller = -2.6659, Lag order = 5, p-value = 0.2989
## alternative hypothesis: stationary
The H0 here is that the time series is non-stationary, and the alternative hypothesis is that it is stationary. Since the p-value is only smaller than 0.05 for building 2, we can reject H0 and conclude that this time series is stationary.
However, building 5 and 8 are above 0.05, but the nature of the three buildings are similar to each other. As we are going to use data from other sources we will try out modeling with non-stationary data and see how it performs.
3.3 Modeling
For the forecasting we are going to try out different models first on one building, which is building 2. The models we are going to compare are:
- basic arima
- sarima
- holt winters
- arima with external regressors
After we have seen how they perform on one building, we will plot the residuals to see which one performs the best on a longer time span. Then the best model is going to be used to forecast the enrgy production of the other building as well.
3.3.0.1 Split into train/test
# building 2
training_b2 <- window(building_2_ts, start=c(2016,8),end= c(2018,11))
testing_b2 <- window(building_2_ts, start=c(2018,12) )
# building 5
training_b5 <- window(building_5_ts, start=c(2016,8),end= c(2018,11))
testing_b5 <- window(building_5_ts, start=c(2018,12) )
# building 8
training_b8 <- window(building_8_ts, start=c(2016,8),end= c(2018,11))
testing_b8 <- window(building_8_ts, start=c(2018,12) )3.3.1 Arima
We first try out the most basic arima model on building 2:
arima_basic <- auto.arima(training_b2[,"energy"])
plot(forecast(arima_basic, h=52))
lines(testing_b2[,"energy"], col="red")
It is not too bad, but also not so good. There is a point at which it is completely unable to capture amount of sun.
3.3.2 Sarima
We try out a Sarima model:
sarima_forecast <- sarima.for(training_b2[,"energy"], n.ahead = length(testing_b2[,"energy"]),
p=0,d=1,q=1, P=1, D=1, Q=0, S=12)
3.3.3 Holt winters
# Building 2
autoplot(building_2_ts[,"energy"])
building_2_ts.fc <- HoltWinters(training_b2[,"energy"])
#building_2_weekly_ts.fc$fitted
plot(building_2_ts.fc)
plot(forecast(building_2_ts.fc, h=52))
lines(testing_b2[,"energy"], col="red")
3.3.4 Arima including external regressors
We decide to include also other variables to our model. As discussed in Feature Engineering section, we chose sun, sunHour, uvIndex and cloudcover, as we believe they may be relevant for our prediction.
arima_b2 <- auto.arima(training_b2[,"energy"], xreg=training_b2[,c("sun","sunHour", "uvIndex", "cloudcover")])
plot(forecast(arima_b2, xreg=testing_b2[,c("sun", "sunHour", "uvIndex", "cloudcover")]))
lines(testing_b2[,"energy"], col="red")